computer control
AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants
Sager, Pascal J., Meyer, Benjamin, Yan, Peng, von Wartburg-Kottler, Rebekka, Etaiwi, Layan, Enayati, Aref, Nobel, Gabriel, Abdulkadir, Ahmed, Grewe, Benjamin F., Stadelmann, Thilo
Instruction-based computer control agents (CCAs) execute complex action sequences on personal computers or mobile devices to fulfill tasks using the same graphical user interfaces as a human user would, provided instructions in natural language. This review offers a comprehensive overview of the emerging field of instruction-based computer control, examining available agents -- their taxonomy, development, and respective resources -- and emphasizing the shift from manually designed, specialized agents to leveraging foundation models such as large language models (LLMs) and vision-language models (VLMs). We formalize the problem and establish a taxonomy of the field to analyze agents from three perspectives: (a) the environment perspective, analyzing computer environments; (b) the interaction perspective, describing observations spaces (e.g., screenshots, HTML) and action spaces (e.g., mouse and keyboard actions, executable code); and (c) the agent perspective, focusing on the core principle of how an agent acts and learns to act. Our framework encompasses both specialized and foundation agents, facilitating their comparative analysis and revealing how prior solutions in specialized agents, such as an environment learning step, can guide the development of more capable foundation agents. Additionally, we review current CCA datasets and CCA evaluation methods and outline the challenges to deploying such agents in a productive setting. In total, we review and classify 86 CCAs and 33 related datasets. By highlighting trends, limitations, and future research directions, this work presents a comprehensive foundation to obtain a broad understanding of the field and push its future development.
A Zero-Shot Language Agent for Computer Control with Structured Reflection
Li, Tao, Li, Gang, Deng, Zhiwei, Wang, Bryan, Li, Yang
Large language models (LLMs) have shown increasing capacity at planning and executing a high-level goal in a live computer environment (e.g. MiniWoB++). To perform a task, recent works often require a model to learn from trace examples of the task via either supervised learning or few/many-shot prompting. Without these trace examples, it remains a challenge how an agent can autonomously learn and improve its control on a computer, which limits the ability of an agent to perform a new task. We approach this problem with a zero-shot agent that requires no given expert traces. Our agent plans for executable actions on a partially observed environment, and iteratively progresses a task by identifying and learning from its mistakes via self-reflection and structured thought management. On the easy tasks of MiniWoB++, we show that our zero-shot agent often outperforms recent SoTAs, with more efficient reasoning. For tasks with more complexity, our reflective agent performs on par with prior best models, even though previous works had the advantages of accessing expert traces or additional screen information.
Why we need human-centered AI
Welcome to AI book reviews, a series of posts that explore the latest literature on artificial intelligence. There are two contrasting but equally disturbing images of artificial intelligence. One warns about a future in which runaway intelligence becomes smarter than humanity, creates mass unemployment, and enslaves humans in a Matrix-like world or destroys them a la Skynet. A more contemporary image is one in which dumb AI algorithms are entrusted with sensitive decisions that can cause severe harm when they do go wrong. What both visions have in common is the absence of human control.
Why we need human-centered AI
Welcome to AI book reviews, a series of posts that explore the latest literature on artificial intelligence. There are two contrasting but equally disturbing images of artificial intelligence. One warns about a future in which runaway intelligence becomes smarter than humanity, creates mass unemployment, and enslaves humans in a Matrix-like world or destroys them a la Skynet. A more contemporary image is one in which dumb AI algorithms are entrusted with sensitive decisions that can cause severe harm when they do go wrong. What both visions have in common is the absence of human control.
The case for human-centered AI
Welcome to AI book reviews, a series of posts that explore the latest literature on artificial intelligence. There are two contrasting but equally disturbing images of artificial intelligence. One warns about a future in which runaway intelligence becomes smarter than humanity, creates mass unemployment, and enslaves humans in a Matrix-like world or destroys them a la Skynet. A more contemporary image is one in which dumb AI algorithms are entrusted with sensitive decisions that can cause severe harm when they do go wrong. What both visions have in common is the absence of human control.
DeepMind Trains Agents to Control Computers as Humans Do to Solve Everyday Tasks
While the design and development of contemporary AI systems has been largely results-oriented, there are also scenarios where it could be advantageous if models learned to do things "as a human would" to help with everyday tasks. That's the premise of the new DeepMind paper A Data-driven Approach for Learning To Control Computers, which proposes agents that can operate our digital devices via keyboard and mouse with goals specified in natural language. The study builds on recent developments in natural language processing, code production, and multimodal interactive behaviour in 3D simulated worlds that have enabled the generation of models with remarkable domain knowledge and desirable human-agent interaction capabilities. The proposed agents are trained on keyboard and mouse computer control for specific tasks with pixel and Document Object Model (DOM) observations, and achieve state-of-the-art and human-level mean performance across all tasks on the MiniWob benchmark. MiniWob is a challenging suite of web-browser-based tasks for computer control, ranging from simple button clicking to complex formfilling.
The Self-Driving Car Is a Red Herring - Issue 92: Frontiers
Ten years ago this fall, Google gave us a glimpse of a new device unlike any it had ever built before--a computer-controlled car. It seemed such a strange thing for an Internet company to spend its time and energy on, a "moonshot" as the company's engineers called such massive efforts. But with a single blog post, the search giant promised to reinvent our cars, and our communities, too. It was a big vision for a single invention to carry. And the details were scant. But we quickly filled in the blanks. Software was going to replace our dangerous, congested, sprawling roads with something utterly safe, seamless and organized. Humans would take the back seat in a new network of "ghost roads," as I call them.
Receptive Environments for Artificial Intelligence
Related to the need for matching the corporate culture is the necessity for having a receptive environment for a project based on AI technology. In addition to the obvious importance of management support, developers and intended users must also become enthusiastic, or at least not antagonistic. During a tour of a heavy manufacturing facility, an Al development group was studying the feasibility of incorporating knowledge systems into the factory. At one point, they came across an old-time machinist who was operating a large metal-cutting machine equipped with extensive computer controls. However, the machinist was not using the computer controls at all; he even ignored the LCD display.
Robots could soon walk like HUMANS
Footage of robots falling over recently caused hilarity on social media. While we may be impressed by their artificial intelligence, humanoids often have an awkward, stumbling gait. Now scientists have developed a new system that they say will allow future robots to walk in the same way as humans, and avoid being knocked over easily. The technology could allow robots to one day take over human jobs, such as serving in the armed forces or doing household chores, the researchers claim. Scientists have developed a system that they say will allow robots to walk in the same way as humans.
Creepy RatCar could drive mobility research
The cyborg armies of the future just got one step closer to total domination. Probably by taking a break from building giant fighting robots, scientists at the University of Tokyo have created the RatCar, a wheeled contraption controlled by a rat's brain. The researchers wanted to prove a simple idea right: that animals could use the parts of their brains that control limbs to control a vehicle. It looks like they can. The goal of the research was to see if it might eventually be feasible for paralyzed people to control wheelchairs using brain implants.